{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive bayes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "朴素贝叶斯是基于贝叶斯定理与特征条件的独立假设的分类方法，对于给定的训练数据集，首先基于特征条件独立假设学习输入\\输出的联合概率分布，然后基于此模型，对给定的输入x，利用贝叶斯定理求后验概率最大的输出y。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "import numpy as np\n",
    "from utils.metric import accuracy_score\n",
    "from utils.progress import normalize,train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 这里使用高斯函数来作为概率密度的计算，因为在实际生活中，很多情况的特征值是连续的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class NaiveBayes():   \n",
    "    \"\"\"The Gaussian Naive Bayes classifier. \"\"\"\n",
    "    def fit(self, X, y):\n",
    "        \n",
    "        \"\"\"\n",
    "        X [shape,features]\n",
    "        y [shape,label]\n",
    "        \"\"\"\n",
    "        \n",
    "        self.X, self.y = X, y\n",
    "        self.classes = np.unique(y)\n",
    "        self.parameters = []\n",
    "        # 计算每一个类别的每一个特征的方差和均值\n",
    "        for i, c in enumerate(self.classes):\n",
    "            X_where_c = X[np.where(y == c)]\n",
    "            self.parameters.append([])\n",
    "            # 计算每一个特征\n",
    "            for j in range(X.shape[1]):\n",
    "                col = X_where_c[:, j] #列\n",
    "                parameters = {\"mean\": col.mean(), \"var\": col.var()} #求方差 与 均值\n",
    "                self.parameters[i].append(parameters)\n",
    "    \n",
    "    def _calculate_likelihood(self, mean, var, x):\n",
    "        \"\"\" 计算高斯概率密度 输入均值 和 方差\"\"\"\n",
    "        eps = 1e-4 # Added in denominator to prevent division by zero\n",
    "        coeff = 1.0 / math.sqrt(2.0 * math.pi * var + eps)\n",
    "        exponent = math.exp(-(math.pow(x - mean, 2) / (2 * var + eps)))\n",
    "        return coeff * exponent\n",
    "    def _calculate_prior(self, c):\n",
    "        \"\"\" 计算先验概率 \"\"\"\n",
    "        X_where_c = self.X[np.where(self.y == c)]\n",
    "        n_class_instances = X_where_c.shape[0]\n",
    "        n_total_instances = self.X.shape[0]\n",
    "        return n_class_instances / n_total_instances\n",
    "    \n",
    "    def _classify(self, sample):\n",
    "        posteriors = []\n",
    "        for i, c in enumerate(self.classes):\n",
    "            # 计算每一个类别的先验概率 p(y=c)=?\n",
    "            posterior = self._calculate_prior(c)\n",
    "            \n",
    "            for j, params in enumerate(self.parameters[i]):\n",
    "                # 提取每一个类别下的特征值的方差 以及 均值\n",
    "                sample_feature = sample[j]\n",
    "                # 计算高斯密度\n",
    "                likelihood = self._calculate_likelihood(params[\"mean\"], params[\"var\"], sample_feature)\n",
    "                posterior *= likelihood\n",
    "            posteriors.append(posterior)\n",
    "        # 求最大概率对应的类别\n",
    "        index_of_max = np.argmax(posteriors)\n",
    "        return self.classes[index_of_max]\n",
    "    def predict(self, X):\n",
    "        y_pred = []\n",
    "        for sample in X:\n",
    "            y = self._classify(sample)\n",
    "            y_pred.append(y)\n",
    "        return y_pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "data = datasets.load_digits()\n",
    "X = normalize(data.data)\n",
    "y = data.target\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = NaiveBayes()\n",
    "clf.fit(X_train, y_train)\n",
    "y_pred = clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.90250696378830086"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(y_pred,y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 实际过程中，朴素贝叶斯用的还是蛮多的，因为计算简单，快。\n",
    "\n",
    "参考\n",
    "\n",
    "1. 李航《统计学习方法》\n",
    "2. 周志华《机器学习》\n",
    "3. [ML_from_scratch](https://github.com/eriklindernoren/ML-From-Scratch)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
